Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
Ensimmäisessä harjoituksessa installoitiin tarvittavat ohjelmat ja yhdistettiin verkkoympäristöön. Itselläni oli ongelmia saada R-studio ja GIT keskustelemaan keskenään ja tästä syystä jouduin tekemään installaatiot useaan otteeseen. Näiden ongelmien jälkeen pääsi harjoittelemaan DATACAMP ympäristössä, mutta jostain syystä platform näkyy normaalia huomattavan pienenä ja en ole tätä vielä ratkaissut.
R-Harjoituksissa käytiin läpi hyvin perus R-käyttöä, mutta hyvin nopeasti lopussa syöksyttiin funktioihin. Mielestäni tässä olisi ollut hyvä olla useampi videoluento aiheesta. Jokatapauksessa erittäin mielenkintoinen konsepti. En ole ennen käyttänyt GIT iä tai Markdownia , joten näistäkin on hyvä saada kokemusta.
Describe the work you have done this week and summarize your learning.
DATA tuonti
data <- read.csv("learing_2019.csv")
#Summary
summary(data)
## X gender Age Attitude Points
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. : 7.00
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Median : 83.50 Median :22.00 Median :32.00 Median :23.00
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :33.00
## Deep Stra Surf
## Min. :1.583 Min. :1.250 Min. :1.583
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417
## Median :3.667 Median :3.188 Median :2.833
## Mean :3.680 Mean :3.121 Mean :2.787
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :4.917 Max. :5.000 Max. :4.333
#Access ggplot and GGally
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#Creating graphics
p1 <- ggplot(data, aes(x = Attitude, y = Points, col = gender))
#and fitting linear model
p2 <- p1 + geom_point() + geom_smooth(method = "lm")
p2
In these graphs we can see that gender plays a role. Biggest difference between genders seems to be on variable attitude, where females have a clearly lower mean. The distributions of the attitude and surf variables differ between males and females.
#Correlations
p <- ggpairs(data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
## Regression model with multible explanatory variables
my_model <- lm(Points ~ Attitude + Stra + Surf, data = data)
# Summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## Stra 0.85313 0.54159 1.575 0.11716
## Surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Only attitude seems to be siqnificant in this fitted model. The multiple R squared of the model is in this case simply the square of the regression between Attitude and Points. Since it is around 0,2, approximately 20% of the variation of points can be explained by the variation of Attitude.
# drawing diagnostic plots using the plot() function. Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The assumptions of the model are that the error terms are approximately normally distributed with mean 0 and identical variation, uncorrelated and independent of the variable of interest. Specifically, the size of the error should not depend on the value of the explanatory or interesting variables.
From these pictures it seems that the model assumptions are approximately correct, with the small exception of small and large values of Points corresponding to some larger deviation from the estimated mean. Also, a couple of observations seem to have somewhat large leverages, but overall the assumptions of the model seem to hold quite well. Questionable is the ends of the spectrum, and additional analysis is needed.
Mette Nissen 15.9.2019 Exercise 3, analysis with student alcohol consumption.
Exercise 3 #Exercise 3
alc <- read.csv("alc.csv", header = TRUE, sep = ",")
library(ggplot2)
library(GGally)
library(tidyverse)
## ── Attaching packages ───────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ dplyr 0.8.3
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.3 ✔ forcats 0.4.0
## ── Conflicts ──────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(knitr)
I have taken variables gender, number of school abscences, going out with friends and final grade in the analysis. Hypothesis is that people performing poorly in school and missing classes are in higher risk of using more alcohol.
I also looked the summary and we can see that the mean age is 17 years. From the barchart we can see how high use is more common in men. THe next boxplot shows people having high use in alcohol going out in both genders. Also final grade seems to be lower. Using age as a function for abscences, we can see in the last figure how abscences seem to be higher in older men with high alcohol consumption.
Same things we can see from mean values. Group has 198 female and 184 male. Dividing groups with high use in females no high use vs high use is 156 and 42 respectevely and same numbers in men are 112 and 72, so higher proportion in men are high users. High use doesn´t affect final grade in female, but we can se a difference in grades in male students: 12.2 in non-high use group and 10.3 in high use group. Abscences are higher in both gender with high use and same is seen in going out with friends see tabels mean abscences and mean goout).
#Summary of the datatable
kable(summary(alc), digits = 2)
| X | school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | nursery | internet | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | higher | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | alc_use | high_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | GP:342 | F:198 | Min. :15.00 | R: 81 | GT3:278 | A: 38 | Min. :0.000 | Min. :0.000 | at_home : 53 | at_home : 16 | course :140 | no : 72 | no : 58 | father: 91 | Min. :1.000 | Min. :1.000 | Min. :0.0000 | no :331 | no :144 | no :205 | no :181 | no : 18 | no :261 | Min. :1.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. : 0.0 | Min. : 2.00 | Min. : 4.00 | Min. : 0.00 | Min. :1.000 | Mode :logical | |
| 1st Qu.: 96.25 | MS: 40 | M:184 | 1st Qu.:16.00 | U:301 | LE3:104 | T:344 | 1st Qu.:2.000 | 1st Qu.:2.000 | health : 33 | health : 17 | home :110 | yes:310 | yes:324 | mother:275 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | yes: 51 | yes:238 | yes:177 | yes:201 | yes:364 | yes:121 | 1st Qu.:4.000 | 1st Qu.:3.00 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.: 1.0 | 1st Qu.:10.00 | 1st Qu.:10.00 | 1st Qu.:10.00 | 1st Qu.:1.000 | FALSE:268 | |
| Median :191.50 | NA | NA | Median :17.00 | NA | NA | NA | Median :3.000 | Median :3.000 | other :138 | other :211 | other : 34 | NA | NA | other : 16 | Median :1.000 | Median :2.000 | Median :0.0000 | NA | NA | NA | NA | NA | NA | Median :4.000 | Median :3.00 | Median :3.000 | Median :1.000 | Median :2.000 | Median :4.000 | Median : 3.0 | Median :12.00 | Median :12.00 | Median :12.00 | Median :1.500 | TRUE :114 | |
| Mean :191.50 | NA | NA | Mean :16.59 | NA | NA | NA | Mean :2.806 | Mean :2.565 | services: 96 | services:107 | reputation: 98 | NA | NA | NA | Mean :1.448 | Mean :2.037 | Mean :0.2016 | NA | NA | NA | NA | NA | NA | Mean :3.937 | Mean :3.22 | Mean :3.113 | Mean :1.482 | Mean :2.296 | Mean :3.573 | Mean : 4.5 | Mean :11.49 | Mean :11.47 | Mean :11.46 | Mean :1.889 | NA | |
| 3rd Qu.:286.75 | NA | NA | 3rd Qu.:17.00 | NA | NA | NA | 3rd Qu.:4.000 | 3rd Qu.:4.000 | teacher : 62 | teacher : 31 | NA | NA | NA | NA | 3rd Qu.:2.000 | 3rd Qu.:2.000 | 3rd Qu.:0.0000 | NA | NA | NA | NA | NA | NA | 3rd Qu.:5.000 | 3rd Qu.:4.00 | 3rd Qu.:4.000 | 3rd Qu.:2.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.: 6.0 | 3rd Qu.:14.00 | 3rd Qu.:14.00 | 3rd Qu.:14.00 | 3rd Qu.:2.500 | NA | |
| Max. :382.00 | NA | NA | Max. :22.00 | NA | NA | NA | Max. :4.000 | Max. :4.000 | NA | NA | NA | NA | NA | NA | Max. :4.000 | Max. :4.000 | Max. :3.0000 | NA | NA | NA | NA | NA | NA | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :45.0 | Max. :18.00 | Max. :18.00 | Max. :18.00 | Max. :5.000 | NA |
# Graphs
p <- ggpairs(alc, columns = c("sex", "absences","goout", "G3", "high_use"), mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
?ggpairs
# initialize a barchart of alcohol use difference between genders
g1 <- ggplot(data = alc, aes(x=sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Student alcohol consumption by sex")
g1
# initialize a plot of alcohol use and going out with friends
g2 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = goout)) + facet_wrap("high_use") + ggtitle("Student going out with friends by alcohol consumption and sex")
g2
# Boxplot of alcohol use and school final grade
g3 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = G3)) + facet_wrap("high_use") + ggtitle("Student final grades by alcohol consumption and sex")
g3
# Scatterplot showing linear model between age and abscences separated by alcohol consumption
g4 <- ggplot(data = alc, aes(x = age, y = absences, color=sex, alpha = 0.3)) + geom_point() + facet_wrap("high_use")+ geom_jitter() + geom_smooth(method = "lm") + ggtitle("Student absences by alcohol consumption, sex and age")
g4
alc %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
## sex count
## <fct> <int>
## 1 F 198
## 2 M 184
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 11.4
## 2 F TRUE 42 11.7
## 3 M FALSE 112 12.2
## 4 M TRUE 72 10.3
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_absences
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 4.22
## 2 F TRUE 42 6.79
## 3 M FALSE 112 2.98
## 4 M TRUE 72 6.12
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_goout
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 2.96
## 2 F TRUE 42 3.36
## 3 M FALSE 112 2.71
## 4 M TRUE 72 3.93
knitr::opts_chunk$set(echo = TRUE)
# glm() model
m <- glm(high_use ~ absences + sex + goout, data = alc, family = "binomial")
# the summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
# the coefficients of the model
coef(m)
## (Intercept) absences sexM goout
## -4.16317087 0.08418399 0.95871896 0.72980859
# the odds ratio (OR)
OR <- coef(m) %>% exp
# the confidence interval (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
OR
## (Intercept) absences sexM goout
## 0.01555815 1.08782902 2.60835292 2.07468346
CI
## 2.5 % 97.5 %
## (Intercept) 0.005885392 0.03804621
## absences 1.042458467 1.13933894
## sexM 1.593132148 4.33151387
## goout 1.650182481 2.64111050
# printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences 1.08782902 1.042458467 1.13933894
## sexM 2.60835292 1.593132148 4.33151387
## goout 2.07468346 1.650182481 2.64111050
From the model alone we can see that all other but G3 are statistically highly significant. Calculating OR and CI we can see that absences, male sex and going out have positive correlation and confidence interval staying above 1 stating significant correlation. In G3 correlation is negative and not significant. For the future predictions I am going to leave G3 from the model.
# probability of high_use
probabilities <- predict(m, type = "response")
# adding the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
# the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
## goout absences sex high_use probability prediction
## 373 2 0 M FALSE 0.14869987 FALSE
## 374 3 7 M TRUE 0.39514446 FALSE
## 375 3 1 F FALSE 0.13129452 FALSE
## 376 3 6 F FALSE 0.18714923 FALSE
## 377 2 2 F FALSE 0.07342805 FALSE
## 378 4 2 F FALSE 0.25434555 FALSE
## 379 2 2 F FALSE 0.07342805 FALSE
## 380 1 3 F FALSE 0.03989428 FALSE
## 381 5 4 M TRUE 0.68596604 TRUE
## 382 1 2 M TRUE 0.09060457 FALSE
# target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 65 49
The last ten cases show real values in high use TRUE 3 and FALSE 7. In the prediction these numbers are TRUE 1 and FALSE 9.
# a plot of 'high_use' versus 'probability' in 'alc'
pr <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
pr2 <- pr + geom_point() + ggtitle("Predictions")
pr2
Here is calculated the error of our model with cross-validation:
# the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
# a loss function
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
It seems that the wrong prediction proportion in this model 21% is smaller than 26% in the dataCamp exercise.
# access packages
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
library(tidyverse)
library(dplyr)
library(knitr)
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# load the data
data("Boston")
This data frame contains the following columns: crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.
In this weeks exercise we use Boston data set from R MASS package which is a histoical data collected from 606 districts in the area around Boston.
Boston has 14 variables and 506 observations. Crime variable is the response variable.
Variables and their explanations are show above.
#Dataset summary and variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
pairs(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
#Graphical summary of crime variable
ggplot(Boston, aes(crim)) + stat_density() + theme_bw()
#Plotting each variable against crime rate
bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
facet_wrap(~variable, scales="free")+
geom_point()
boxplot(Boston$crim, Boston$zn, Boston$indus, Boston$chas, Boston$nox, Boston$rm, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv, names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))
mlm <- lm(formula = crim ~ ., data = Boston)
summary(mlm)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
Most significant variables in the model are dis and rad with high significance, median value with moderate significance and zone, black with lower but still under p 0.05 significance.
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot.mixed(cor_matrix, number.cex = .6)
Corrplot shows the relationships between variables. Highest positive correlation are between rad and tax, indus and nox and age and nox. Highest negative correlations are between age and dis, lstat and med and dis and nox. Wee can see from the summaries that distribution of the variables is very inconsistent and thus we need to scale the dataset before doing linear discriminant analysis later.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
With standardizing data is centralized. This is done to continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation. With this variables´mean is 0 and SD is 1.
# creating a quantile vector of crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#removing crim
boston_scaled <- dplyr::select(boston_scaled, -crim)
#adding categorical variable to the table
boston_scaled <- data.frame(boston_scaled, crime)
For predicting with data we need a model training set which is in this case decided to be 80% of the cases and the rest of the data is used as a test set which shows the accuracy of the model.
n <- nrow(boston_scaled)
#Choosing 80% to the training set
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
# creating the test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2500000 0.2400990 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 1.01626464 -0.8965988 -0.08835242 -0.8942294 0.46705380
## med_low -0.09837428 -0.2815210 0.03952046 -0.5609621 -0.17104470
## med_high -0.37054365 0.1983494 0.13355755 0.4354556 0.08566594
## high -0.48724019 1.0171737 -0.03371693 1.0639214 -0.36325665
## age dis rad tax ptratio
## low -0.9150665 0.8596107 -0.6877778 -0.7362074 -0.49921221
## med_low -0.3080388 0.3512236 -0.5429521 -0.4897092 -0.08830576
## med_high 0.3817827 -0.3674527 -0.4407893 -0.3068249 -0.24327061
## high 0.8657740 -0.8685989 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.37617542 -0.77413000 0.56416552
## med_low 0.32058485 -0.08732011 -0.03812375
## med_high 0.07012611 0.00507139 0.12466697
## high -0.73352939 0.93217720 -0.75687837
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.103946328 0.701368161 -0.923301885
## indus 0.005826649 -0.169098074 0.267157585
## chas -0.089461281 -0.003019469 0.184356918
## nox 0.379209305 -0.857015004 -1.420230222
## rm -0.085241425 -0.066342304 -0.172456555
## age 0.197533727 -0.310353360 0.006552849
## dis -0.093239721 -0.296769577 0.199030896
## rad 3.381565548 1.065734828 0.099077042
## tax 0.037747247 -0.117480078 0.359357651
## ptratio 0.128956908 -0.049849868 -0.390652825
## black -0.103229646 0.028183335 0.123811796
## lstat 0.265876391 -0.151753540 0.422870356
## medv 0.196375418 -0.377687799 -0.239479042
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9502 0.0363 0.0135
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
l <- plot(lda.fit, dimen = 2, col = classes, pch = classes)
l + lda.arrows(lda.fit, myscale = 1)
## integer(0)
# saving the correct classes from test data
correct_classes <- test$crime
# removing the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 9 9 2 0
## med_low 5 16 4 0
## med_high 0 8 19 2
## high 0 0 0 28
From the cross table we can see that high values are predicted very nicely, but in the lower classes more errors occure.
# Boston dataset reading and standardization again
data("Boston")
b_boston_scaled <- scale(Boston)
# Distances with euclidean distance
dist_eu <- dist(b_boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <- kmeans(b_boston_scaled, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(b_boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal cluster size is the point where the line drops. In this it seems to be two.
# Clustering again
km2 <- kmeans(b_boston_scaled, centers = 2)
pairs(b_boston_scaled[,1:7], col = km2$cluster)
pairs(b_boston_scaled[,8:14], col = km2$cluster)